{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "7a2f1e2d-ce46-4a23-874a-4947767b927e",
   "metadata": {},
   "source": [
    "# Preparing the data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8cf6c001-8db2-4039-9acd-959e5228c73f",
   "metadata": {},
   "source": [
    "## Code Description\n",
    "\n",
    "This code snippet performs data processing on two datasets: a graph dataset and a humans dataset. It accomplishes the following tasks:\n",
    "\n",
    "- Load Data:\n",
    "    - Reads the `graph_2023` dataset containing edges of a graph from a CSV file.\n",
    "    - Reads the `humans` dataset containing Wikipedia titles and gender information from a CSV file.\n",
    "\n",
    "- Preprocess Data:\n",
    "    - Converts all text in the `Source` and `Target` columns of graph_2023 to lowercase.\n",
    "    - Converts all text in the `en_wikipedia_title` column of `humans` to lowercase and strips any leading or trailing whitespace.\n",
    "\n",
    "- Filter Graph Data:\n",
    "    - Creates a set of Wikipedia titles from the `humans` dataset.\n",
    "    - Filters `graph_2023` to keep only the edges where both `Source` and `Target` are in the set of Wikipedia titles.\n",
    "    - Further filters the edges to only include those where the Target is also in the list of Source titles (i.e., ensuring that Source and Target are interchangeable).\n",
    "\n",
    "- Merge Data:\n",
    "    - Merges the filtered graph data with the humans dataset to add a gender column based on the Source title.\n",
    "    - Cleans up the gender column by removing unwanted characters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "acdff306",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "c:\\Users\\Paschalis\\miniconda3\\envs\\bio2bio\\lib\\site-packages\\tqdm\\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
      "  from .autonotebook import tqdm as notebook_tqdm\n"
     ]
    }
   ],
   "source": [
    "from pathlib import Path\n",
    "import os\n",
    "import sys\n",
    "\n",
    "# set the environment variable for the current process\n",
    "os.environ[\"DATA_ROOT\"] = \"D:/Users/Paschalis/phd/data/\"\n",
    "\n",
    "repo_root = Path.cwd().parent\n",
    "# point to src directory for imports\n",
    "sys.path.insert(0, str(repo_root / \"src\"))\n",
    "\n",
    "from connecting_people.graph.build_graph import build_connecting_people_graph\n",
    "data_root = Path(os.environ[\"DATA_ROOT\"])\n",
    "G = build_connecting_people_graph(\n",
    "    input_file_path=data_root,\n",
    "    output_file_path= data_root / \"connecting_people2/bio_to_bio_df2.parquet\",\n",
    ")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "bio2bio",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.19"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
